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ABSTRACT 

Four recent indices emphasizing the 
interrelationships of score distribution shape, modality, mean, and 
variance were investigated to determire the reliability of mastery 
tests. Attention was focused on the values of the indices when the 
cutoff score was near to or far from the modes of distribution. Five 
types of score distributions were examined: till shaped; highly 
negatively skewed unimodal; bimodal with a stronger mode at the upper 
end; symmetric bimodal with modes well separated; and symmetric 
biaodal with modes near each other. Indices examined were based on: 
(1) Brennan and Kane's index of dependability; (2) Buynh* s single 
administration estimate of the kappa coefficient of agreement; (3) 
the mean split-half coefficient of agreement, a revision cf an 
earlier formulation by Marshall and Haertel; and (4) Subkoviak's 
single administration estimate of the coefficient of agreement. 
(Results are discussed and an explanation of symbols and formulas are 
appended.) (HH) 
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Purpose 



The issue of how to dt2*!:ermine Tirastery test reliability is less than fully 
settled, particularly wten it comes to the issi^e of which inde:^ use. The 
purpose of this stuc was -22 shed some quasi -empirical light o?i e subject by 
examining four relar-vei / '^et ^nt indices, with attention to th initerrelation- 
ships of score distr bu:: shape, modality, ard prc^imity of issttery cutoff 
score to areas of he^-vy zz:x'-^ densT^y (modes). The four singl. lofarini strati on 
indices examined wer- (^soner^tmes variations or revisions :f) tho,s^> due to 
Brennan and Kane (If^ . -tuvnh (19~5), Marshal and -iaerxel (■ , and 
Subkoviak {1976a). 

Many investigcto'^ :ni5 fi -d lold tha^: rsisisry -est 1 /-^ility 
should deal with t-s: zancy Ticstery/r^DTEfeErte^y deczr^^'^ , or of allo- 

cation to mastery ^^cate — rer t^^ die classH^l notion cr rr'-nststency of 
sc-^re itself. Fl ^ rrs n an na*^ >ndual 's aactual scc^^e i«rallel or 

reesated testing . u .ms ire nc c-or5ider^ ImportHrrt, tr " .Jaimed, 
ufr.-^rss they also ti iconsist t maste;: 3tate caJsaonzarTons. Yet, 

vi' ng the situa::.iui r ; ■ica'i'y, is pressed not conclude that 
5ir>/*" 's grouped nea^ - j: sraxulc ^ocneho^ rmtribune^ les'^ ..^ tine mastery 
te^t^'? reliability thd>' : tiiose tnat are mn-e disrani '^otr the cutoff. 
Th. ' ^^tion in t' 1v was focused on the values of those Iwiiices when 

the mCxff score > rnsdi \- or far frcm the .Tno&(s) of the distrt^3»jjtion. 
Pro^gggre 

A computer prnr designed and writt^ by the authors, generated item- 
by-e<aminee response ic:»rrre' t or incorrect) nnatrices, according Z3 parameters 
selected to control score distribution shapC;^ nscdality, mean, and variance. 
From each matrix, the srn^« distribution wai^ toiained, and test indices were 
calculated for aT integral cutoff scores, int^x value as a function of cut- 
off score was graphed, s s the relativie fnrcmuency distribution of scores, 
so that score distribuiTini siape and mod&vSji, cutoff score, and index value 
could be visually compare:. The rationale ffor doing this was that, for a 
given score distribution sh^e, index values srould be relatively lower when 
cutoff score is near a mme and relatively +fTrmer when cutoff score is in an 
area of very light score rer^^sity, if the T.imiex is to reflect the property 
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rnentioned earl ier. 

Five types of score dlsu^buitions we*^ innvestigatad: bell-shaped ( -/^ ) t 
highly negatT/eely skewec uniranrial (^.^^ rnnodal wit^ a stronger mode at the 
upper end synn^tric fcimiodal with Traces well separated ( fVA ) « and 

Bjmmetric bii?^Ul with inodes inear each other [^/^). Ihese shapes were only 
approximated afetained^ si nets the computer prragram has 3uilt-in random error 
components i"^ order to stimulate the results erf actual test-taking situations. 
Each score xi3stribijti on shap'& was investigafei for test :, of 5^ 10, and 20 
items. 

2 

Indices 

A. Index of -cv0encab ''M"y , Mi. (Br^nman and Kane, '^'977] 

Brennari ann kane rhoose wt to c^]^ -imss a reliabilftfty index, for reasons 
discussed ir the --efe^^.c cited. It was included in thi:s study, however, in 
order to see whetner i c As-^ed any properties wTth the cttmrs. The index is 
similar to that c^" Livirrar' ,r] ^972), but is iffised on gettsrrlizability theory 
rather than class cb7 tes theory . The index is define: in :srms of expected 
squared ifeyiationi: -tttt "he cutoff score, C. 

B. 1. A single-acirTrisfcratian estirmate (Huijrr^. 1976) kappa coefficiem:^ 
< (Conen, I960). 

This estimate ass^rnes tir^ot truf scores follow a beta cisrrtbution with 
parameters estimatf< ^cn^ the inear? and variiprnce of the observetl score distri- 
bution; responses en xaral't^l te'sts are inctependent and fullow the binomial 
error model . 

2. A single-admi- -tri^tir ' c^stfirate o-p-^^e coefficient of agreement (pro- 
portion of consirtent decisior ^ (Humrr, 1976 '. 
The same assumpt jn: app y her^. 

C. Another single-adrrm^s rati ^ estimate oT the coefficient c agreement, 

(Subkoviak, 1975). 
This index is based or ^^-^ ssNijnptic.i that the probability that each 
E?E59n-15-?55l9"^^ ^-'"^ maste^-y srate on parallel tests ^ollows a 

1 It is recognized that i^i le^ ^ritr^ria luay be employed in evaluating 
reliability indices. In this:raDer, however, the criterion addressed is 
whether the indices reflect sisrne rr^stribution mode(s). 

2 Appendix A contains all camniz'^-: formulas used. 
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binomial form, and incorporates a true score for each person estimated via 
linear regression using observed score, and observed score mean and variance. 

D. The mean split-half coefficient of agreement, 3 (Marshall, 1976), 
which is a revision of an earlier formulation (Marshall and Haertel , 1975). 

This index is equal to the mean (over persons) proportion (over all 
possible test splits) of consistent mastery decisions on a hypothetical 
double-length test, scores on which can be estimated in a variety of ways. 
Five different methods, or models, for estimating double-length test scores 
or score distributions were used in this study, and are outlined in Appendix 
A. 

E. In addition, four of the above indices — P^j, 3 (Huynh, 1978; 
Subkoviak, 1977; Marshall, 1976) — can be generalized to multiple mastery 
states (more than one cutoff score.) These generalized indices were also 
investigated in this study, but only for the case of three mastery states. 

F. Because of the close association between < and the fourfold correlation 
index (phi coefficient), cj) was also calculated on the basis of quantities 
generated in the calculation of coefficient 3j in order that k 
might be compared with cj) for each model. 

Results 

Although the study generated a great deal of data, the focus reported 

3 

here is on the degree to which the indices reflect score distribution modes. 
1. The index of dependability, M(C) ^ is clearly different from the others, 
and the conclusion is that it measures quite different things. It did not 
reflect score distribution modes (except, of course, when the distribution 
was unimodal and the mean and mode coincided, since, as Brennan and Kane 
indicate, M(C) always has a minimum at J.) In fact, M(C) shows the same rela- 
tionship to KR21 as Livingstons's index does to KB20: the minimum value of 
each coefficient, occuring at the score distribution mean, equals the 
respective Kuder-Richardson estimate. 
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3 A more complete report will soon be available and may be obtained by 
writinr either of the authors. 
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2. Coefficient k also measures very different things than do ^'^^ and 6, 
as its formulation suggests. Not only did it not reflect score distributi 
modes (except, by coincidence, when the distribution was bimodal at the 
extremes), but it behaved in the very opposite way for unimodal distriba:rn)n?>, 
having a maximum rather than a minimunn at the mode for symmetric distritmriiQns 
and near the mode for skewed distri buttons. This is because < takes on ^^^^ 
maximum value in the vicinity of the test mean. 

3. Huynh's p did reflect the score mocsBS for unimodal di s tribute cms— tteat - 
distributions which approximate one of the beta family, in accordance wii iie 
assumptioms for that index. The coefficient did not, however, reflect sc - 
modes when the shape of the score distribution was bimodal, which: is ofte 
case for mastery tests, unless the modes were so extreme as to copy one : e 
J-shaped or U-shaped beta distributions (a situation which is not likely 
happen in the real world, particularly when guessing occurs). Based on Ut^ 
research of this study, the authors hypothesize that the p coefficient '4 
fare better on this criterion if Huynh had chosen a predictive Bayestarr b^t^ 
binomial approach (Aitchison and Dunsmore, 1975), akin lo D, I. in Appendf/ 
even though that approach is slightly more complex. Although earlier reseir^Xfl 
(Subkoviak, 1978) recommended the, Huynh procedure, it should be noted - > 
Subkoviak's study dealt only with unimodal distributions closely approxr 

a beta distribution. It is, likely that the recommendations would have 
otherwise had bimodal distributions similar to those in this study beer* ^r''- 
gated. 

4. Subkoviak's p generally reflected score modes very well, for bott; 
unimodal and bimodal distributions. The one exception was when the dis ciciv 
was bimodal and the modes were close together, particularly for short z ^ 
But since this type of score distribution is atypical, the Subkoviak a: "j- i s, 
overall, highly satisfactory. 

5. Of the five estimation models for Marshall's coefficient g, model I m. a the 
l??5^_§§^l5f^^^^^y» reasons to be discussed in part 7 of this sect^:<n 

4 In this situation, a compound rather than a simple binomiel model wnailci Bre 
better; in all other cases the simple and compound binomial aodels yieidec 
nearly identical results, supporting Subkoviak' s (1978) findfmgs. 
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Model 1 r^i'ifected ssrcrr ^ocfis, but uiniformly not as well as did models 2, 4, 
anc: 5-, exe^^pt when ItH arsrrrbution was ur^JmodaK Models Z and 4 were nearly 
ideratical , lare onffv erceartton being the s1i:;jation described above for Subkoviak's 
P^. Model 4 fs tntas'-::^ preferable of the two for reasmrs of simplicity. Thus 
the aoics nSBorrowrs in ^ to ffnodels 4 and 5. Model 5 yie^dec isstter results in 
the tuHrinn cffisc:rtfe^^<; above, and slightly iaetter results when the ^two modes 
are wRde-iy ^arx^t. ftodel 4 yielded bett£:r results fc- s/^ort for the 

asyn«E=tric biiuuu«j] Other than that, l3e two models me^e COTip.^,»KDle «vnd 

yiercsr 1 very ss'-^^isr^ vjn results vis-a-vis the mode reflec^^ion cv it.rion. 

6. - the thr. ^e-^=^^ier-3tate indices, a rrimodal distriiDution w as constructed; 
p, P^.. and £ fnodei 1 1 4, and 5 were calculated for varimr^ iDixiations of 
cuto-^ score^-.. Of : and 63 did not reflect score mkm^^ laic the other 
three mode ^ jic itierpretation is more difficult, however, inc:^ tha._graphs 
invo1>e?^ shcujh be -dimensional (index value as a function dt^ -he tisto cutoff 
scores , ant t/K c aux.er program was not se^: up to handle tf s situation. The 
authors plan resecrrcrr this topic further. 

7. Tfcrs stu(ti> Drodu::*^: another significant finding , which -right have been but 
was rtn^ deduQeenathe--rtically beforehand, and thus rendere.^ an element of 
surprfs. Avlthojgh - following results have not yet been proved rigorously 
(the fa .thors re work ig on it), the computer-calculated empirical evidence is 
so cvverwhelTPtng that feel secure in claiming the following conjectures: 

i) SU'c'^r req. ires assumptions about a beta-binomtE:3 distribution, if 
anan^y js a^-v.jnption: (Tor a double- rather than single-l-ngth test) are postu- 
lateii ^r coe-^icient 3, model 3, the two indices are ictertical, i.e., p = 33. 
IhL:. plains why coefficient 3, model 3, had unsatisfacTrry characteristics. 
Thi:. —ojectjre (and the two to follow) is backed up by over 300 pairs of 
cal:aa>^^d inrdex values indentical to three decimal places, over all ranges of 
cutirf '^ore, distribution type, and test length. Moreover- one would suspect 
that ^ used the predictive Bayesian formulation as suggested earlier, it 
would uoe«r these conditions equal 3i. 

ii) :ance k requires the same assumption as does p, wl^ the phi 
coeffici-es^ is calculated according to the formula in AppendTi?;> A under model 

3 , KrC= ; . 
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iii) Since entails ssiamptions about binoorial error and a rssHEfsssion 
estimate of true score, if jdnalogous (for 2n itens) assumptions are mto for 
coeffictent inodel 4, th? two indices are identical, i-e,, is 
further nypattesized that "-^ ::ne compound bintMnial model were used frr- 
then under thet«e conditions : would equal B: 

It is ajqj^wnt, then, t^r.u: tr"^^ question f whether to employ tte5t Huynh p 
or Siibfeovia^ \, or Marshal ^ is not relevant, since each of the ^=r*ot two 
is a signal iBSLtance of the thi i more general coefficient, when t& 
appropriit^^ ^ ^SKimptions are "ns^^rred. The question instead should De v^hich 
set of as^sujmjttons is approtnnci^for the situation. 
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Appendix A : Symbols and Formulas 



Different authors use different symbols for the same thing. In order to 
minimize confusion, we have in this paper used a set of symbols that are 
as close as feasible to the authors' originals yet which are in common 
usage and have the same meaning throughout. If this is a compromise, we 
hope it is a compromise in favor of consistency and clarity. 

Symbols 

In what follows, these symbols have a common meaning: 

n = number of test items 

19 = number of persons 

X = an obtained test score, 0 < a;.< n 

f(x) = frequency of score x in the obtained score distribution 

X = test mean 
2 

= score variance 

C = mastery cutoff score (where x denotes "mastery"), 0 < c < 

a = Kuder-Richardson formula 21 

21 

a = Kuder-Richardson formula 20 



20 



Computing Formulas 

A- MJ^) (Brennan & Kane, 1977) 



M(C) = 1 



1 



X(n - X) - 5' 



X 



(X - cr -h 

X 



B. 1. K (Huynh, 1976; Cohen, 1960) 

.2 J //^ ^2 



K = 



11 ^1 ^1 ^1 



n 



where p = I h(xjx") 
x,x^^' C 



n 



s^nd p = I h(x) 
x=€ 
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Here, h(x) is the univariate negative hypergeometric density, 
[x)B(a-fx^n'fb-x)/B(ajb) and h(x^x'') is the bivariate density, 
[x)[x"]B(a'fX'fx^j2n'fb-X''X^)/B(ajb) in which B represents the beta 
function and a and b are parameters estimated by 

21 ' 

1-a \ 



b = (n-X) 



2 1 



a 

2 1 



£ (Huynh, 1978) 

C-1 n 
p = 1 h(x,x') + \ h(Xjx') 

x,x'= 0 XyX'=C 

where hixyx') is as previously defined. 



(Subkoviak, 1976) 



in which pr;sr>c; = ^ ("je^^ ri-e^; 

J— c 

where 6 re; . esents the true score of a person with obtained score 
of X and is estimated by 



X 2i[n 



e^^ = a -) + n 



-.^ ft) 
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D. 3 (Marshall, 1976) 
fn what follows, 

y = a possible score on a hypothetical double-length test, 0<^y<_2n 

f(y) = estimated frequency of score y in the (hypothecial ) double-length 
test score distribution 



^-1 



-1 2C-2 n+C-1 2n ~\ 

lf(y)+ I f(y)'H (y-lc-^.c-1) + I f(y)-H (c,y-c) + I ,A 

=0 y=C ^ y=2C ^ y=n+C^^y^J 



where H (ii<t>) is a partial sum of hypergeometric terms: 



(y](2n-y \ 



Note that the second term in the brackets above vanishes when C=^ and 
the third term vanishes when C=n. 

In this, f(y) can be estimated in a number of ways. We have chosen 
five estimation techniques which correspond with the different models 
for coefficient 3 discussed in the presentation. 
1. Predictive Bayesian beta (Aitchison & Dunsrnore,1975} 

f(y) = I f(^) \ y ) B(a+x+y,2n+b'X-y)/B(a'i'X,n-^b-x) 
x=Q 

where B is the beta function and a^b are estimated as in the Huynh 
procedure. 
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Compound binomial (Lord, 1965; adjusted for 2n items) 
f(y) = . b(y;Zn,B^) . [l ^ 2„^2„.i;e^n-9^; 1 

where b(y;2n,Q ) = [yj 9 H-d ) 
e = a {^) + n-a )(^) 

a: 2ol"/ 20 I «/ 

Q = -2nr2n-i;e 2 ^y(^n-^)Q - y(y-}) 

k ^ 

in which S ^ is the variance of the item difficulties. 

Beta distribution with parameters that are functions of the 
obtained score distr^ibution (similar to that used for Huynh's 
coefficients, but adjusted for 2n items). 

f(y) = ^ 1^^) B(ai-y, 2ni-b-y)/B(a,b) 
where B^a^b are as before. 

Binomial Regression (similar to that used by Subkoviak in his 
index, but adjusted for 2n items) 

where 9 is as in Subkoviak' s coefficient. 

X 



Averaged "double binomial" 

This one was conjured up by the authors in an attempt to find 
an f(y) estimate that does a better job than do most of the others 
in echoing the modes of the obtained X distribution. Although 
mathematically less defensible, its empirical properties are 
generally good. 



in V r n-l r 

^ \^ 1 X=l 



T r 1 

(x-hk)' (x-k) 



2n n n 

V - - 
^(r,s) = f t^"^(l-t)^"^dt and t = ^ 



/JT cj[nJ {lycntijj^ 



Other models v/ere considered but rejected either as too complex and 
not worth the trouble or on the grounds that preliminary research 
showed them to have undesirable characteristics. The former 
category includes various methods which involve arcsine transforma- 
tions and some methods suggested by Wilcox (1978) (to whom we 
acknowledge appreciation for suggesting the Aitchison & Dunsmore 
reference); the latter category includes the unbiased (^n / estimate 
of and a predictive Bayesian model with vague priors. Preliminary 
research also showed that model 3 outlined above fell in this category, 
but it was retained for comparison purposes. 
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Indices which allow multiple mastery states . For this study, no more than 
three mastery states were considered; the following formulas have been 
simplified accordingly. In these formulas, K is the lower and c is the 
upper cutoff score, 0<i^<C<n- 
1, K(K,C) = (p.-p^)/n-p^) (Huynh, 1978) 



where p 



K-} C-1 n 

• - I h(x,x^) -h I h(x,x^) + I h(x, x^) 

«A^j VO W tit J tit tltjtlt """O 



and 




where h is as defined for k, 



2. p(K,C) = p. as defined above. (Huynh, 1978) 



P(K,C) 



(Subkoviak, 1977) 



P(K,C) 



x=0 L 



where 



m 



Cx 




and 



is as defined for 



X 



4. Q(K,C) (Marsirall, 1976) 



Q(K,C) = I 



2k~2 2C-2 
lf(y)+ 1 f(y)'H(y-[KA\,KA) + I f(y)-H(u,v) 
j/=0 y=K ^ y=2K ^ 

n+C-] 2n 
+ I f(y) ' H (a,y-o) + I f(y) 
y=2C ^ y=n+a 



where f(y) depends on the model , 

V = min (C-^^ y-K) 

and H (J^yn) is as defined for 6. 
Note that the second term in the brackets vanishes when K=} and the 
fourth term vanishes when C=n. 

Calculation of Phi Coefficient 
AD - 

V = 



(A+E) (D+E) 

n+C-] 2n 

where A= I f(y)-H (C,y-C) + I f(y) 

y=2C y y=rn+C 

C-1 2C-2 
D = I f(y) + I f(y)'H(y-[c-}'],c-]) 

y=0 y=C y 



E = (N-A-D)/2 

The above expressions for A and D can be seen to be derived from the 
formula for coefficient B. 
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Appendix C 



M(C) 'Brennan & Kane, 1977) 



(Huynh, 1976) 



{Huynh, 1976) 



P (Subkoviak, 1976) 



3 model 1 



3 model 2 



3 model 3 / Variations of 3 (Marshall, 1976) 



3 model 4 



3 model 5 



(j) model 3 



